1 Introduction

This document provides an overview of the bayesOmic package that will be available at Bioconductor (http://bioconductor.org/). The package implements a Bayesian approach for omic association studies. We propose a shared-component model to tease out the feature information (e.g. SNPs, CNVs, genes, transcripts, CpGs) that is common to cases and controls from the one that is specific to cases. This allows to detect the SNPs (bayesSNPassoc function) or any continuos feature (bayesOmicAssoc function) that show the strongest association with the disease. The model can nevertheless be applied to more than one disease. More detailed information about the model and assumptions are given in J. J. Abellan, Abellan, and Gonzalez (2010) and González, Abellán, and Abellán (2012) in the case of analyzing SNPs or CNVs, respectively. We illustrate how to analyze SNP data by using a synthetic data set of a targeted study (e.g. a selection of SNPs), GWAS data comparing different populations from 1000 Genomes project (To be supplied) and expression data (i.e. a RangedSummarizedExperiment object for an RNA-Seq experiment) on four human airway smooth muscle cell lines.

2 Getting started

The bayesOmic package uses JAGS, a program for analysis of Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation Plummer and others (2003), to estimate model parameters. The current implementation of bayesOmic is based on JAGS version 4.3.0. JAGS has an R interface rjags that is used by the package (rjags version 4-9).

The development version of bayesOmic can be installed from our GitHub repository

library(devtools)
install_github("isglobal-brge/bayesOmic")

Then, the package is loaded as usual

library(bayesOmic)

3 Analysis of SNP data

Let us illustrate how to apply our model to a GWAS study as described in J. J. Abellan, Abellan, and Gonzalez (2010). The examples correspond to a simulated data including information about 72 SNPs. It contains information of 800 individuals divided in 4 populations: controls and 3 type of cases (M1, M2, and M3). We simulated SNPs that are specific from gene 1 to be associated with M1 and SNPs from gene 3 to M3.

Let us start by describing how to get the data into R

3.1 The data

Data can be imported from a text file or can be loaded using snpMatrix package (to be supplied). We provide a simulated example that can be loaded by typing:

data(sim.data)

This is a simulated data having 72 SNPs and one column with case/control status

dim(sim.data)
[1] 800  73
sim.data[1:5, 1:5]
     caco snp74519580 snp30066728 snp32588600 snp57373945
1 Control          AB          AA          BB          AB
2 Control          AB          AB          AA          AB
3 Control          AA          AB          AB          AA
4 Control          AB          AB          AA          AA
5 Control          AB          AA          AA          AA

The package requires to have the case-control status and the SNPs in to different objects:

y <- sim.data$caco
SNPs <- sim.data[, -1]

It is worth to notice that GWAS data is normally in PLINK format. The Genomic Data Storage format is ….. To be supplied (modify function)

3.2 Model parameter estimates

The model can be fitted by using the function bayesSNPassoc as follows:

mod <- bayesSNPassoc(group = y, data=SNPs, sep.allele="", method="JAGS")

The function bayesSNPassoc prepares the data (NOTE: it aggregates the number of rare alleles by the grouping variable. When SNP data is a data.frame it uses SNPassoc package to get the number of rare alleles. In that case the sep.allele argument defines how alleles are separated - sep.allele="" is the default value). Then the function calls rjags to estimate model parameters. We have set up the following default arguments to be passed through rjags functions:

args(bayesSNPassoc)
function (group, data, sep.allele = "", annotation, chr, call.rate = 0.9, 
    min.freq = 0.05, sig.level = 0.05, method = "inla", n.iter.burn.in = 1000, 
    n.iter = 5000, thin = 10, n.chain = 2, ...) 
NULL

Notice that other arguments realated to MCMC estimation using JAGS can be passed through this function. More details about them can be obtained at http://calvin.iarc.fr/ martyn/software/jags/.

The function also has two arguments to perform a simple quality control. The argument call.rate stands for the call rate of a given individual (e.g. proportion of non-missing SNPs, default is 0.9) and min.freq provides the minimum allele frequency allowed for a given SNP (default is 0.05).

3.3 Checking convergence

Before interpreting the simulations obtained from de a posteriori distri bution (e.g. model parameters), Markov chains convergence migth be verified. This can be done by using the function checkConvergence. This function has an argument called type that defines the type of plot to be obtained. When type="Markov chain" (default value) the function calls to plot.mcmc() function from package coda Ontherwise, Gelman-Rubin plots are displayed. The function checkConvergence has another argument, parameter, to indicate the model parameter to be summaryzed. The default is alpha. For example, Figure can be obtained by executing:

checkConvergence(mod)
Chains convergence of `alpha` parameter for SNP model of 'sim.data'.

Figure 1: Chains convergence of alpha parameter for SNP model of ‘sim.data’

Other model parameters (Figure \ref{fig:checkConvergence2) are summaryzed by changing the argument called parameter.

checkConvergence(mod, parameter = "beta")
Chains convergence of `alpha` parameter for SNP model of 'sim.data'.

Figure 2: Chains convergence of alpha parameter for SNP model of ‘sim.data’

The trace plot can be obtained by

checkConvergence(mod, type = "trace")
Chains convergence of `alpha` parameter for SNP model of 'sim.data' using Gelman-Rubin method.

Figure 3: Chains convergence of alpha parameter for SNP model of ‘sim.data’ using Gelman-Rubin method

The JAGS estimation can be very time consuming. INLA (Integrated Nested Laplace Approximation) can be used instead

mod.inla <- bayesSNPassoc(group = y, data=SNPs, sep.allele="", method="INLA")

The print() and plot() functions also work with this method

This are the shared components

plot(mod.inla, type="shared")

and this the specific

plot(mod.inla, type="specific")

3.4 Results

Model parameters (intercept and shared component) can be obtained by using print function:

mod

 Intercepts (alpha): 
                         2.5%        50%      97.5%
Control -0.8453899 -0.8692369 -0.8454839 -0.8212086
M1      -0.7903692 -0.8410633 -0.7895358 -0.7412849
M2      -0.8351066 -0.8612506 -0.8348243 -0.8094776
M3      -0.7754098 -0.8237909 -0.7770748 -0.7237574

 Coefficients of shared components (beta): 
                     2.5%        50%     97.5%
M1 1.4383355 1.235339e-08 0.31073920 11.639091
M2 0.4755054 3.245502e-10 0.08822252  2.736444
M3 1.8677611 6.676204e-09 0.39134913 13.082979

 Use 'getSpecific()' and 'getShared()' functions to get specific or shared components, respectively. 

On the other hand, specific and shared components can be visualize by:

plot(mod, type="specific")

and

plot(mod, type = "shared")

respectively.

Finally, a hierarchical clustering can be performed by using the predicted probabilities by typing:

makeHeatmap(mod)

The figure shows a Heatmap were we can observe that groups M1 and M3 are different from controls and group M2.

4 Analysis of CNV data

We provide a real data example that can be loaded by typing:

data(armengol)

Model parameter estimates are obtained using the function bayesOmicAssoc by executing:

mod.CNV <- bayesOmicAssoc(group="pop", data=armengol)

The intercept ans coefficient of shared components of populations can be obtained by:

mod.CNV

 Intercepts (alpha): 
                 2.5%      50%    97.5%
CEU 1.963230 1.900412 1.962320 2.036787
CHB 1.980471 1.928113 1.981505 2.023672
YRI 1.994390 1.948687 1.995756 2.034592

 Coefficients of shared components (beta): 
                        2.5%         50%      97.5%
CEU -0.003547929 -0.34902437  0.01602909 0.37232342
CHB  0.012235748 -0.18975163  0.04219106 0.19559769
YRI -0.007897079 -0.09120255 -0.01331183 0.06396489

 Use 'getSpecific()' and 'getShared()' functions to get specific or shared components, respectively. 

We can check model convergence by

checkConvergence(mod.CNV)

checkConvergence(mod.CNV, type="trace")

The shared components are

plot(mod.CNV, type="shared")

The Specific components for each population can be obtained by typing:

plot(mod.CNV, type="specific")

We can obtain the name of the specific features of a given population by

getSpecific(mod.CNV, group="YRI")
                          0.02083333%        50%    99.97917% sig
A_14_P133280               0.18845241  0.3757252  0.551398456   1
Chr16_70653608             0.16521939  0.3567767  0.517470120   1
Chr22_22690592            -1.08263271 -0.9346243 -0.703315278  -1
Chr3_46771035             -1.23736689 -1.0565910 -0.815555478  -1
Chr3_164046699            -0.56913859 -0.3663315 -0.147128074  -1
Chr6_32594054             -1.44036759 -1.2597585 -1.018248402  -1
Chr7_141965669             0.23258746  0.5614095  0.702714180   1
Chr1_193470134            -1.26381352 -0.9798692 -0.748392943  -1
Chr9_Pop_1                 0.12200154  0.3227919  0.495663848   1
Chr15_Pop_1                0.05656680  0.2363705  0.413002858   1
Chr1B_Pop_1                0.61187254  0.9533931  1.137132224   1
Chr16_Pop_1                0.11959767  0.2886871  0.440029242   1
Chr13_Pop_2               -1.26638780 -1.0673508 -0.860049649  -1
Chr15_Pop_2               -0.70157723 -0.4421261 -0.155667714  -1
Chr1B_Pop_2               -1.25738914 -1.0607002 -0.806091356  -1
Chr22_Pop_2                0.18955438  0.3759284  0.559668906   1
Chr16_Pop_2                0.39454559  0.7717614  0.961744645   1
Chr14_Pop_1               -0.38825197 -0.2424276 -0.001948421  -1
Chr14_Pop_2                0.01802474  0.2249488  0.406063662   1
chr17_420_B               -0.69969064 -0.4758211 -0.270970698  -1
A_14_P127743               0.11448146  0.4845810  0.643373741   1
A_14_P126481               0.11941396  0.5329414  0.766685591   1
chr17_335_A                0.68280135  0.8967815  1.088850265   1
chr1_723_B                -0.95357575 -0.7694218 -0.373258176  -1
chr17_576_B               -1.02370573 -0.7175496 -0.481911087  -1
chr19_8743145_8784792_B   -0.87422378 -0.6373753 -0.366011338  -1
chr17_42973027_43038163_A -0.97373829 -0.7655191 -0.563640810  -1
chr5_150_A                -0.96944956 -0.6375767 -0.389076625  -1
chr5_150_B                -1.07444547 -0.9112062 -0.720817903  -1
chr7_143_A                -0.43812801 -0.2659312 -0.053226534  -1
chr12_111_A               -0.61933486 -0.4481253 -0.071816101  -1
chr12_111_B               -1.00907552 -0.8049506 -0.590159226  -1
chr16_543_A                0.75241065  0.9579340  1.134285249   1
chr17_415_A                0.04738639  0.2986104  0.503667435   1
chr5_704_A                 0.51322367  0.7545153  0.968261105   1
chr5_704_B                 0.56210398  0.8396936  1.063031532   1
chr17_429_A2               0.09920391  0.4137488  0.743640914   1
chr17_429_B2               0.54696027  0.7407740  0.993223187   1

The heatmap is obtained by:

makeHeatmap(mod.CNV, quantiles=c(.025, 0.2, 0.5, 0.8, 0.975))

5 Analysis of expression data (SummarizedExperiment objects)

Dataset can be loaded by typing:

library(airway)
data(airway)

We can analyse a specific genomic range of interest (i.e. chromosome 1, positions 1 to 10000000). We define the roi by executing:

library(GenomicRanges)
library(IRanges)
roi <- GRanges(seqnames = "1", ranges= IRanges(1,10000000))

Model parameter estimates are obtained using the function bayesOmicAssoc with the argument roi by executing:

mod.Expr <- bayesOmicAssoc(group="dex", data=airway, roi = roi)

The intercept ans coefficient of shared components of populations can be obtained by:

mod.Expr

 Intercepts (alpha): 
                   2.5%      50%    97.5%
trt   65.38641 59.02911 64.70174 73.11770
untrt 67.72806 60.76016 67.39973 74.87659

 Coefficients of shared components (beta): 
                      2.5%       50%    97.5%
trt   -33.630672 -1173.774 -33.64596 1106.829
untrt  -9.640538 -1133.550 -10.45542 1116.820

 Use 'getSpecific()' and 'getShared()' functions to get specific or shared components, respectively. 

We can check model convergence by

checkConvergence(mod.Expr)

checkConvergence(mod.Expr, type="trace")

The shared components are

plot(mod.Expr, type="shared")

The Specific components for each population can be obtained by typing:

plot(mod.Expr, type="specific")

We can obtain the name of the specific features of a given population by

getSpecific(mod.Expr, group="trt")
                0.007598784%      50% 99.9924% sig
ENSG00000074800    2029.6426 2212.090 2370.688   1
ENSG00000116285     852.5968 2377.657 3856.968   1

The heatmap is obtained by:

makeHeatmap(mod.Expr, quantiles=c(.025, 0.2, 0.5, 0.8, 0.975))

6 Acknowledgments

This work has been partly supported by ….

References

Abellan, Juan J, Carlos Abellan, and Juan R Gonzalez. 2010. “A Bayesian Shared Component Model for Genetic Association Studies.” bepress.

González, Juan R, Carlos Abellán, and Juan J Abellán. 2012. “Bayesian Model to Detect Phenotype-Specific Genes for Copy Number Data.” BMC Bioinformatics 13 (1). BioMed Central: 130.

Plummer, Martyn, and others. 2003. “JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling.” In Proceedings of the 3rd International Workshop on Distributed Statistical Computing, 124:10. 125. Vienna, Austria.